Introduction

Dataset

Description of Variables

“First Attempts” Geo Plots

Baseline EDA of Income and Population

Population Histogram and QQ

A baseline analysis of population and income was conducted. The histogram for population appeared skewed to the right. The different census tracts had similar population counts with a mean of about 4000. Counties were not evenly spread out as some had a population of 1 million and others 10 million. With similar populations, census tracts were easier to investigate instead of counties. The Q-Q plot confirmed the non-normality as the values between quartiles 3 and 4 were far away from the line.

Income Histogram and QQ

The raw data for income appeared very skewed to the right as well. The data appeared to follow a power-law curve as some individuals have amassed a large amount of income and these outliers can skew the data. Thus, the outliers and NA values were removed; checking again, the “cleaned data” appeared normal. The histogram appears monomodal and the error terms along the Q-Q plot did not stray away from the line.

## [1] "26139.73"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     128   18776   24730   26140   32247   56040    3589
## [1] 10274.98

Individual EDA of Freedom scores

Next, the seventeen independent variables were analyzed. The freedom scores were economic freedom, personal freedom, regulatory policy, fiscal policy, and overall freedom. The box plots were split up into four evenly distributed quartiles by the income per capita in each quartile. For all the five sets of boxplots, there did not appear to be any differences between the quartiles as they all overlapped roughly the same range of their respective independent variables. The histograms did not appear normal as overall the data was randomly spread out with huge gaps between bins. The Q-Q plots told a similar story as the error terms tended to follow a sin-like trend over the line and there were big tails on either end. None of the freedom scores appeared to be distributed normally.

## Observations per group: 18417, 17756, 22780, 13244. 1064 missing.
##  Factor w/ 4 levels "[-0.827,-0.256]",..: 3 3 3 3 3 3 3 3 3 3 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.8272 -0.2556  0.0152 -0.0866  0.1376  0.3550    1064

## Observations per group: 18061, 18704, 20888, 14544. 1064 missing.
##  Factor w/ 4 levels "[-0.0444,0.0135]",..: 1 1 1 1 1 1 1 1 1 1 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.0444  0.0135  0.0803  0.0641  0.1064  0.2450    1064

## Observations per group: 19180, 17739, 17836, 17442. 1064 missing.
##  Factor w/ 4 levels "[-0.457,-0.223]",..: 3 3 3 3 3 3 3 3 3 3 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.4569 -0.2228 -0.0737 -0.1511 -0.0322  0.0715    1064

## Observations per group: 18164, 19958, 16580, 17495. 1064 missing.
##  Factor w/ 4 levels "[-0.37,-0.0602]",..: 3 3 3 3 3 3 3 3 3 3 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.3702 -0.0602  0.0634  0.0646  0.1767  0.4024    1064

## Observations per group: 18440, 17906, 18395, 17456. 1064 missing.
##  Factor w/ 4 levels "[-0.814,-0.102]",..: 2 2 2 2 2 2 2 2 2 2 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.8136 -0.1021  0.0652 -0.0224  0.1637  0.4614    1064

Individual EDA of Work Variations

Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality. The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased. The only work variation that exhibited an increase in average income was professional work. The remaining variables of office and self-employed remained relatively stable across quartiles. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. The remaining six work variations were all skewed to the right. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.

## Observations per group: 18765, 18330, 18095, 17970. 101 missing.
##  Factor w/ 4 levels "[0,5.1]","(5.1,7.7]",..: 2 4 2 3 1 3 3 3 3 2 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.100   7.700   9.028  11.400 100.000     101

## Observations per group: 18379, 18323, 18173, 18281. 105 missing.
##  Factor w/ 4 levels "[0,24.1]","(24.1,32.6]",..: 3 1 2 2 4 2 1 3 2 2 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    24.1    32.6    34.8    43.8   100.0     105

## Observations per group: 18303, 18828, 17766, 18259. 105 missing.
##  Factor w/ 4 levels "[0,20.1]","(20.1,23.8]",..: 2 2 2 3 1 4 3 4 3 1 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   20.10   23.80   23.95   27.50  100.00     105

## Observations per group: 18672, 18068, 18275, 18141. 105 missing.
##  Factor w/ 4 levels "[0,13.5]","(13.5,17.9]",..: 2 4 4 3 2 2 4 1 2 2 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    13.5    17.9    19.1    23.6   100.0     105

## Observations per group: 18588, 18209, 18113, 18246. 105 missing.
##  Factor w/ 4 levels "[0,5]","(5,8.4]",..: 3 3 3 3 1 2 3 2 2 3 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.000   8.400   9.295  12.500 100.000     105

## Observations per group: 18566, 18226, 18085, 18279. 105 missing.
##  Factor w/ 4 levels "[0,7.1]","(7.1,11.8]",..: 3 4 3 3 3 3 3 1 3 4 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.10   11.80   12.86   17.40  100.00     105

## Observations per group: 19155, 17839, 18195, 17967. 105 missing.
##  Factor w/ 4 levels "[0,3.6]","(3.6,5.5]",..: 2 3 4 1 2 3 1 3 2 3 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   3.600   5.500   6.227   8.100 100.000     105

## Individual EDA of ethnicities Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated. The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth. The boxplot for Asian showed an increase from the first through the fourth quartile. The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly. The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract. The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native. All of the error terms along the Q-Q plot line for each of the ethnicity variables followed a curve with large left and right tails. Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. For the native Q-Q plot, there was a clear pattern of the error terms along the line implying non-normality. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.

## Observations per group: 18430, 18252, 18303, 18276. 0 missing.
##  Factor w/ 4 levels "[0,0.7]","(0.7,3.7]",..: 3 4 4 2 4 3 4 3 3 3 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.70    3.70   13.27   14.40  100.00

## Observations per group: 18472, 18246, 18233, 18310. 0 missing.
##  Factor w/ 4 levels "[0,2.4]","(2.4,7]",..: 1 1 1 3 1 3 2 1 1 1 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.40    7.00   16.86   20.40  100.00

## Observations per group: 20124, 17253, 17651, 18233. 0 missing.
##  Factor w/ 4 levels "[0,0.2]","(0.2,1.4]",..: 2 3 2 1 3 1 1 1 1 2 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.20    1.40    4.59    4.80   91.30

## Observations per group: 18351, 18331, 18285, 18294. 0 missing.
##  Factor w/ 4 levels "[0,39.4]","(39.4,71.4]",..: 3 2 3 3 2 3 3 3 4 3 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   39.40   71.40   62.03   88.30  100.00

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.7279   0.4000 100.0000

Correlations

!! full2015_varofInterest

## corrplot 0.84 loaded

Scatter PLots

ANOVA

!! full_2015_varOfInterest

## Call:
##    aov(formula = IncomePerCap ~ EthnicPlurality, data = anova_dat)
## 
## Terms:
##                 EthnicPlurality    Residuals
## Sum of Squares     1.632113e+12 5.681161e+12
## Deg. of Freedom               4        69566
## 
## Residual standard error: 9036.911
## Estimated effects may be unbalanced
## 3589 observations deleted due to missingness
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "contrasts"     "xlevels"       "call"         
## [13] "terms"         "model"
##                    Df    Sum Sq   Mean Sq F value Pr(>F)    
## EthnicPlurality     4 1.632e+12 4.080e+11    4996 <2e-16 ***
## Residuals       69566 5.681e+12 8.167e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 3589 observations deleted due to missingness

## Call:
##    aov(formula = IncomePerCap ~ WorkPlurality, data = anova_dat)
## 
## Terms:
##                 WorkPlurality    Residuals
## Sum of Squares   2.642140e+12 4.671134e+12
## Deg. of Freedom             6        69564
## 
## Residual standard error: 8194.432
## Estimated effects may be unbalanced
## 3589 observations deleted due to missingness
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "contrasts"     "xlevels"       "call"         
## [13] "terms"         "model"
##                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## WorkPlurality     6 2.642e+12 4.404e+11    6558 <2e-16 ***
## Residuals     69564 4.671e+12 6.715e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 3589 observations deleted due to missingness

Chi Squared Tests

Work Ethnicity Total
Asian Black Hispanic Native White
Construction 0
18
31
104
639
137
1
3
358
767
1029
1029
Error 0
0
0
0
0
0
0
0
0
0
0
0
Office 128
192
1562
1087
2197
1424
17
30
6811
7982
10715
10715
Production 19
70
480
398
993
521
6
11
2422
2920
3920
3920
Professional 927
851
2345
4803
2506
6294
114
131
41474
35287
47366
47366
SelfEmployed 0
0
0
1
1
2
0
0
11
9
12
12
Service 240
174
2744
983
3273
1288
49
27
3388
7222
9694
9694
Unemployment 0
8
257
43
113
56
15
1
39
316
424
424
Total 1314
1314
7419
7419
9722
9722
202
202
54503
54503
73160
73160
χ2=NaN · df=28 · Cramer’s V=NaN · Fisher’s p=0.000

observed values
expected values

Regression

17 variables from the dataset were chosen to perform the regression. Using measures such as R-square, Adjusted R-Square, Complexity Parameter (CP) , Bayesian Information Criterion (BIC) and Residual Sum of Square (RSS) the best variable that could fit the model was selected. To begin with different measures v/s number of variables were plotted. ##Exhaustive Search From the distribution graph, almost all the variables are included which was not helpful. If we were to select the best variables then it would be ‘WHITE’, ‘Native’, ‘Asian’,‘‘Professional’,‘Office’,‘Construction’,‘Production’,‘Unemployment’, ‘Fiscal Policy’‘Personal Freedom’ and ‘service’ which is at highest R2 Value .68. Adjusted R2 - The highest adjusted R2 is obtained at .68 and this is similar to that of R2, the only difference is the variable ‘Service’ is excluded. For BIC and CP, the lowest values are 12 and 10 and they are obtained when we include 12 and 8 variables in the regression

## Reordering variables and trying again:

##  [1] "np"        "nrbar"     "d"         "rbar"      "thetab"    "first"    
##  [7] "last"      "vorder"    "tol"       "rss"       "bound"     "nvmax"    
## [13] "ress"      "ir"        "nbest"     "lopt"      "il"        "ier"      
## [19] "xnames"    "method"    "force.in"  "force.out" "sserr"     "intercept"
## [25] "lindep"    "reorder"   "nullrss"   "nn"        "call"

## [1] 14

## [1] 13

## [1] 12

Forward Selection

R2 is not used as a criteria , which usually improves with number of variables and leads to overfitting. For Adjusted R2,we could choose the range of best variables from 4 to 10. The best variables are ‘WHITE’, ‘Professional’, ‘Unemployment’, ‘Personal Freedom’, ‘service’ which is at highest R2 Value .68. For BIC and CP, the lowest measure value variables are 12 and 10 respectively.

## Reordering variables and trying again:

Backward Selection

Now backwards (nvmax=17 and nbest=2) Best variables from the range of 9 to 11 could be choosen and they are ‘WHITE’, ‘Native’, ‘Asian’,‘‘Professional’,‘Office’,‘Construction’, ‘Production’,‘Unemployment’, ‘Fiscal Policy’‘Personal Freedom’ and ‘service’. For BIC and CP, the lowest measure value variables are 12 and 10 respectively.

## Reordering variables and trying again:

Sequential Replacement seqrep

Lastly Sequential Replacement. How accurate and precise are these models ? we don’t know yet, until we run some validation set approach or cross validations.

## Reordering variables and trying again:

Conclusion